Reliability and validity of the Patient Health Questionnaire-4 scale and its subscales of depression and anxiety among US adults based on nativity

Background The burdens of anxiety and depression symptoms have significantly increased in the general US population, especially during this COVID-19 epidemiological crisis. The first step in an effective treatment for anxiety and depression disorders is screening. The Patient Health Questionnaire-4 (PHQ-4, a 4-item measure of anxiety/depression) and its subscales (PHQ-2 [a 2-item measure of depression] and Generalized Anxiety Disorder [GAD-2, a 2-item measure of anxiety]) are brief but effective mass screening instruments for anxiety and depression symptoms in general populations. However, little to no study examined the psychometric properties (i.e., reliability and validity) of the PHQ-4 and its subscales (PHQ-2 and GAD-2) in the general US adult population or based on US nativity (i.e., foreign-born vs. the US-born). We evaluated the psychometric properties of the PHQ-4 and its subscales in US adults, as well as the psychometric equivalence of the PHQ-4 scale based on nativity. Methods We conducted a cross-sectional survey of 5,140 adults aged ≥ 18 years. We examined the factorial validity and dimensionality of the PHQ-4 with confirmatory factor analysis (CFA). A multiple-group confirmatory factor analysis (MCFA) was used to evaluate the comparability of the PHQ-4 across nativity groups. Reliability indices were assessed. Also, the scales’ construct validities were assessed by examining the associations of both the PHQ-4 and its subscales’ scores with the sociodemographic characteristics and the 3-item UCLA Loneliness scale. Results The internal consistencies were high for the PHQ-4 scale (α = 0.92) and its subscales of PHQ-2 (α = 0.86) and GAD-2 (α = 0.90). The CFA fit indices showed evidence for the two-factor structure of the PHQ-4. The two factors (i.e., anxiety and depression) were significantly correlated (r = 0.92). The MCFA demonstrated measurement invariance of the PHQ-4 across the nativity groups, but the model fits the data better in the foreign-born group. There were significant associations of the PHQ-4 scale and its subscales’ scores with the sociodemographic characteristics and the UCLA Loneliness scale (all p < 0.001). Conclusions The PHQ-4 and its subscales are reliable and valid measures to screen anxiety and depression symptoms in the general US adult population, especially in foreign-born individuals during the COVID-19 pandemic.

Substantial evidence suggests the first step in an effective treatment for anxiety and depression disorders is screening [10][11][12][13][14][15].Hence, there is a need for better screening for anxiety and depression in the general U.S. population.The detection of these disorders, especially their comorbidity, and their mass treatments in the general adult population requires a mass but brief screening instrument.
Anxiety and depression are commonly screened with a reliable and valid 4-item self-reported instrument called Patient Health Questionnaire-4 (PHQ-4) [12,13].The PHQ-4 measures both anxiety and depression symptoms.It consists of two subscales: PHQ-2 (a 2-item measure of depression) and Generalized Anxiety Disorder (GAD-2, a 2-item measure of anxiety) [12,13].The PHQ-4 is an ultra-brief screening instrument for the symptoms, which are not indicators of clinical diagnostic disorders but indicators for further assessments by mental health professionals or clinicians to determine the presence of GAD and major depressive disorder (MDD) [12,13].The PHQ-4 is the most widely used self-reported screening instrument for anxiety and depression because of its ease of administration and demonstrated psychometric properties in general populations and patients [3,6,11,13].
Despite the usefulness of the PHQ-4 and its subscales (i.e., PHQ-2 and the GAD-2) for mass screening, little to no study examined the reliability and validity of these scales in the general U.S. adult population or based on U.S. nativity (i.e., foreign-born vs. the US-born).The reliability and validity results revealed in the previous studies cannot be generalized to all U.S. populations, given differences in populations' experiences, culture, and socioeconomic status [18][19][20].Most studies in the U.S. on anxiety and depression examined these disorders among the general population without considering differences between foreign-born individuals and US-born individuals [21].A systematic review reported that foreign-born or migrant individuals experienced significant anxiety and depression [18].Socioeconomic differences, migration stress, and difficulty adapting to their host countries' culture significantly impact their mental health [18,20].In a study in the U.S., it was reported that anxiety and depression associated with US-nativity differed across ancestral origin groups, with those from Mexico, Eastern Europe, and Africa or the Caribbean having higher risks, especially foreign-born individuals who arrived at the age of 13 years or higher [22].
This current study, therefore, aimed to examine the reliability and validity of PHQ-4, PHQ-2, and GAD-2 in a large national sample of U.S. adults and based on their nativity.Specifically, we assessed the item characteristics, reliability, construct validity, and factorial structure of the PHQ-4, PHQ-2, and GAD-2.We also assessed the associations of the scale scores of PHQ-4, PHQ-2, and GAD-2 with the sociodemographic characteristics of U.S. adults to further examine the validity of these scales.Similarly, loneliness has been associated with mental health outcomes, including anxiety and depression [23].Individuals experiencing loneliness are more burdened with anxiety or depression symptoms [23].Hence, we evaluated the associations of the PHQ-4 scale and its subscales with loneliness to determine their validity in our sample.We expect to find a better-fit model for the two-factor structure of the PHQ-4 scale compared to the one-factor structure.We also expect to find unequal psychometric properties of the PHQ-4 scale based on U.S. nativity.

Study design and participants
The participants included a random sample of U.S. adults aged ≥ 18 years who were recruited in a national anonymous online cross-sectional survey.The survey participants' recruitment and distribution, sponsored by the National Institute of Health, were executed by Qualtrics LLC between May 13, 2021, and January 9, 2022.The survey was developed and conducted in English.Qualtrics LLC oversampled low-income and rural individuals within US-born White, Black, Hispanic, and foreignborn populations to enhance the study participants' representativeness.The survey was distributed to 10,000 participants, and 5,938 of them completed the survey, with 5,413 participants providing valid responses.The invalid responses included data we were unable to ascertain or incomplete surveys.We conducted a complete case analysis; therefore, 5,140 individuals with complete cases were included in the analysis.We assessed the differences in the sociodemographic characteristics of the complete cases and those excluded from the analysis; we found no significant differences in their sociodemographic characteristics.Besides, we had only 5% missingness, which is less than the 10% missingness threshold to result in bias estimates [24][25][26].The Patient Health Questionnaire-4 (PHQ-4) scale was used to assess anxiety and depression among the participants.The survey also assessed the participants' sociodemographic characteristics and loneliness.Ethical approval was obtained for the study on December 23, 2020, from the National Institutes of Health's Institutional Review Board ([IRB] #000308).

Main outcomes
The PHQ-4 is a 4-item unipolar self-reported scale comprising the PHQ-2 and the GAD-2 subscales [12,13].The PHQ-2 items are: (1) little interest or pleasure in doing things and (2) feeling down, depressed, or hopeless.The GAD-2 items are: (1) feeling nervous, anxious or on edge and ( 2) not being able to stop or control worrying.The items are based on how often the participants have been bothered in the last two weeks, and the response options include not at all = 0, several days = 1, more than half the days = 2, and nearly every day = 3.The total PHQ-2 and GAD-2 scores range from 0 to 6, and the PHQ-4 total score ranges from 0 to 12 [12,13].Total scores of ≥ 3 on any of the scales indicate anxiety (GAD-2), depression (PHQ-2), and both anxiety and depression (PHQ-4) symptoms.

Exposures
The 3-item UCLA Loneliness scale (short version) was used to measure loneliness among our survey participants [23,27,28].The participants were asked to respond to the following three questions: (1) How often do you feel that you lack companionship?(2) How often do you feel left out? and (3) How often do you feel isolated from others?The response options for each question include 1 = hardly ever, 2 = some of the time, 3 = often).The total possible scores range from 3 to 9 [23,27,28].The previous studies provided evidence of the reliability (alpha values ranged from 0.72 to 0.91) and validity (r = 0.82) of the 3-item UCLA Loneliness scale (short version) [23,27,28].We found a similar alpha value of 0.88 for our study's 3-item UCLA Loneliness scale.
Existing studies found that sociodemographic characteristics such as age, nativity, race/ethnicity, sexual and gender identity, level of education, marital status, employment, and income are known risk factors for anxiety and depression [6,13,22,29].Hence, we included these sociodemographic characteristics in our study to evaluate their associations with the PHQ-4 scale and its subscales of PHQ-2 and GAD-2.

Statistical analysis
STATA/SE version 16 [30] and Mplus version 8.6 [31] were used to perform this study's statistical analyses.STATA was used to conduct all the analyses, while both STATA and Mplus were used to conduct the one-factor and two-factor structure analyses.We analyzed the items' frequency distributions and descriptive statistics for PHQ-4, PHQ-2, and GAD-2.We conducted summary statistics to determine each item's means, standard deviations, skewness, and kurtosis.We used the skewness, kurtosis, quantile-quantile plot (Q-Q plot), and standardized normal probability or probability-probability plot (P-P plot) to examine the normality of the distributions.We also evaluated the items for missing data.Furthermore, we examined the internal consistencies of the PHQ-4, PHQ-2, and GAD-2 using Cronbach's alpha (i.e., α) to determine their reliability [32,33].The alpha values of at least 0.70 are considered satisfactory or desirable [32][33][34].Additionally, we computed the composite/construct reliability (also known as Jöreskog's Rho) to test the composite reliability of the constructs [35,36].
We examined the factorial validity and dimensionality of the PHQ-4 with confirmatory factor analysis (CFA).We evaluated the 2-dimensional structure (i.e., GAD-2 vs. PHQ-2) and a 1-dimensional structure (i.e., the PHQ-4 total score) of the PHQ-4 by examining two different factor models using the Maximum likelihood (ML) method, which is an effective and robust estimator in analysis involving large samples and normally distributed data [37].We computed 95% confidence intervals (95% CIs) for the factor loadings.We assessed the two factors' convergent and discriminant validities to evaluate their inter-correlation.Evidence of inter-correlation suggests convergent validity, while lack of evidence of or weak inter-correlation indicates discriminant validity [38][39][40][41].We used average variance extracted (AVE) and squared correlations (SC) to determine the convergent and discriminant validities [38][39][40][41].The AVE represents the average level of variance the latent constructs explain in their indicators relative to the total indicators' variance or the amount of variance due to measurement error [38][39][40][41].The AVE values greater than 0.50 (i.e., 50%) demonstrate evidence of convergent validity, further indicating that the latent construct explains more than 50% of the indicator variance [38][39][40][41].There is evidence of discriminant validity when the AVE value is greater than or equal to the SC between the two latent constructs, further suggesting that the two latent constructs share more variance with their associated indicators than with their different sets of indicators in the model [38][39][40][41].
To examine the comparability of the factor structure of the PHQ-4 across native groups (US-born vs. foreignborn), we conducted a multiple-group confirmatory factor analysis (MCFA).We particularly evaluated the consistencies of the PHQ-4 scale for varying groups (i.e., US-born vs. foreign-born).Further, we examined and compared three increasingly restrictive models (i.e., configural, metric, and scale measurement invariance models) with the MCFA based on similar approaches used and recommended by other researchers [6,29,42].
We first examined configural measurement invariance by fitting a model (i.e., an unconstrained model) where all other parameters were freely estimated to determine whether the patterns of the factor loadings were the same in the two native groups or whether the model fits well equally in each of the two native groups.We then examined metric measurement invariance once the configural invariance was established.In this second model, factor loadings were constrained to be equal between the two groups.Once evidence of metric invariance was determined, the scalar measurement invariance (i.e., equal intercepts model) was examined by constraining the item intercepts and factor loadings.The metric measurement invariance model was compared with the configural measurement invariance, while the scalar invariance model was compared with the metric measurement invariance model.A non-significant test suggests the model under consideration fits the data just as well as the model estimated in the previous step of invariance testing.
Overall fit and model comparisons were evaluated using six criteria or indices.These indices include the Root Mean Square Error of Approximation (RMSEA), the Standardized Root Mean Residual (SRMR), the Comparative Fit Index (CFI), the Tucker-Lewis Index (TLI), and the likelihood ratio test (LRT).The RMSEA and SRMR values less than 0.08 suggest acceptable model fit, or values less than 0.05 indicate good model fit [6,13].Also, RMSEA values between 0.08 and 0.1 suggest marginal fits [6,13,43,44].The RMSEA was estimated at 95% CI.The CFI and TLI values greater than 0.95 indicate good model fit, while the values > 0.90 denote acceptable model fit [6,13].With the model comparisons, the LRT was used to compare a less restricted model (i.e., nested or simple model) to a more restricted model (i.e., complex or full model) with a statistically significant test suggesting a better fit of the more restricted model to the data than the less restricted model; otherwise, the more restricted model fits the data just as the less restricted model [45][46][47][48][49].
Analysis of variance (ANOVA) for at least three categories or groups and two-sample t-tests for two categories were used to assess the associations of sociodemographic characteristics with the PHQ-4 scale and its subscales of PHQ-2 and GAD-2.We performed the Bonferroni multiple-comparison test or Bonferroni adjustment for the ANOVA tests to account for multiple testing and determine which pairs of groups have significantly different scale scores.Additionally, we used Pearson's correlation to assess the intercorrelations between the PHQ-4 scale, PHQ-2, and GAD-2 with the UCLA Loneliness scale -short version to determine the construct validity, specifically convergent validity.We computed the 95% CI for the Pearson's correlation estimates.
Table 2 shows the description of GAD2, PHQ4 and PHQ2, as well as the covariances and reliability.The mean (SD) scores of the PHQ-4, PHQ-2, and GAD-2 were 3.28 (3.67), 1.60 (1.89), and 1.67 (1.97), respectively.The mean scores for the items ranged from 0.80 to 0.85.Because we conducted a complete case analysis, the data had no missingness to compute the multiple or mean imputations [50,51].The descriptive statistics results in Table 2 showed that the skewness and kurtosis values were not extreme because the values for skewness were within ± 2, while the values for kurtosis were less than the|7.0|thresholds for high kurtosis [36].Hence, the responses on the items do not deviate from normality, and the sample size is large enough to approximate normality and generate robust estimates [36,52,53].The quantile-quantile plot (Q-Q plot) and standardized normal probability or probability-probability plot (P-P plot) also revealed the normality of the distributions (Figures not provided).

Dimensionality of PHQ-4: Two-factor model versus onefactor model
The comparison between one and two-factor structures of the PHQ-4 is displayed in Table 4.The results revealed strong evidence for two-factor structures of the PHQ-4.The model fit indices show that the two-factor model fits the data better than the one-factor model (two-factor model: RMSEA = 0.072; SRMR = 0.005; TLI = 0.990; CFI = 0.998 versus one-factor model: RMSEA = 0.203; SRMR = 0.025; TLI = 0.919; CFI = 0.973).These results were further confirmed by the Likelihood Ratio Chisquare test for the model comparison (Likelihood Ratio Chi-Square Difference [ΔLR χ 2 ] = 397.44,Δdf = 1, p < 0.001).As displayed in Fig. 1, the standardized factor loadings were higher for the PHQ-4 two-factor solution or dimension (rs = 0.82 to 0.92) than for the one-factor solution (rs = 0.79 to 0.90).Thus, the factor loadings on the PHQ-4 two-factor solution implied that 81% (i.e., 0.90 2 ) and 82.81% (i.e., 0.91 2 ) of the variances in items 1 and 2 are explained by GAD-2, respectively; 67.24% (i.e., 0.82 2 ) and 84.64% (i.e., 0.92 2 ) of the variances in items 3 and 4 are explained by PHQ-2, respectively.All the factor loadings were high and statistically significant (all p < 0.001).The two factors (i.e., anxiety and depression) were significantly correlated (r = 0.92).Thus, 84.64% (i.e., 0.92 2 ) of the variance in GAD-2 and PHQ-2 is shared with the PHQ-4.We also evaluated the convergent and discriminant validities of the two factors.The AVE for GAD-2 (AVE = 0.821) and PHQ-2 (AVE = 0.755) were less than their SC (SC = 0.841), indicating no problem with convergent validity but a problem with discriminant validity.AVE for GAD-2 (AVE = 0.821) and PHQ-2 (AVE = 0.755) were greater than the 0.50 threshold, demonstrating evidence of convergent validity.The AVE values were, however, less than their SC value (SC = 0.841), indicating no evidence of discriminant validity.The composite reliability values (GAD-2: Jöreskog's Rho = 0.90 and PHQ-2: Jöreskog's Rho = 0.86) for the two factors were also higher than the AVE values for each factor, which supports the construct reliability.

Multiple group confirmatory factor analysis models
The results for the single-group CFA and the multiplegroup CFA are shown in Table 5.The results of the single group CFA showed evidence of good model fit for both US-born and foreign-born groups, but the model appeared to have a better fit in the foreign-born group than the US-born group.The MCFA results indicated there was evidence that all the measurement invariance models fit the data well.The configural invariance model revealed similar factor structures in the two native groups.The metric invariance also showed that the factor loadings are equal across the two native groups.When the metric model was compared with the configural model, the test was not statistically significant (ΔLR χ 2 = 3.49, Δdf = 3, p = 0.322), indicating a metric invariance.We proceeded to compute the scalar invariance model given the evidence of metric invariance.The scalar invariance model fits the data just as the metric invariance model (ΔLR χ 2 = 1.34,Δdf = 3, p = 0.719); the assumption of equal item intercepts holds, and therefore, none of the two groups consistently have higher scores on the items than the other, adjusting for the latent construct.The practical goodness-of-fit indices (i.e., RMSEA, SRMR, TLI, and CFI) also confirmed the better-fitted models, or no worsening fits when applying the additional invariance constraints.

Construct validity (Convergent validity)
The intercorrelation between the PHQ-4 total scale score and the UCLA Loneliness scale-short version was 0.63 (95% CI: 0.61, 0.65) (   These intercorrelations were statistically significant (all p < 0.001), implying good convergent validity of the PHQ-4 scale and its subscales.Table 6 presents the associations of the PHQ-4 scale and subscale scores with the sociodemographic characteristics of the general sample.This evaluation was to assess the validity of the constructs (PHQ-4 and its subscales) using the independent samples t-test and ANOVA to determine whether the scale scores vary or differ among the participants based on their sociodemographic characteristics.The PHQ-4 scale and its subscale scores were significantly associated with all the sociodemographic characteristics (all p < 0.001), suggesting differences in the scale scores between groups.The following interpretations of the mean scores are not based on p-values for the pairwise comparisons.US-born   abcde Means differences in groups not statistically significant for the comparison of the scale scores between groups using the Bonferroni multiple-comparison test in the ANOVA tests.Thus, groups with letter combinations (e.g., ab, ac, cd) have no statistically significant mean differences.Groups without letter combinations (e.g., ab, ac, cd) have statistically significant mean differences Differences in two groups were examined with a two-sample t-test, and the differences in at least three groups were evaluated using the ANOVA test individuals had higher depression (i.e., PHQ-2), anxiety (i.e., GAD-2), and anxiety/depression scores than foreign-born individuals.The scores were also higher for younger adults between ages 18-25 or 26-34 years than for older adults; the scores for individuals who identified as non-binary, transgender, or other compared to those who identified as males or females; lesbian or gay and bisexual, especially bisexual, individuals exhibited higher scale scores than heterosexual/straight individuals.Latinos/Hispanics and other racial groups scored higher on the scales than Black/African, White, and Asian Americans; the participants with lower educational levels compared to those with higher educational levels had higher scale scores; these scores were higher for those currently unemployed, in unpaid work/voluntary job/apprenticeship, permanently sick or disabled, or students than for those who employed or retired.Participants who were never married scored higher on the scales than those who were divorced, widowed, separated, or married; the scores were higher for those who had less than $35,000 annual household income than for those who had income higher than $35,000.

Discussion
Previous studies examined the psychometric properties of the PHQ-4 scale among Hispanic Americans [6], Colombians [17], Germans [13], patients in the U.S [5,11,12], and U.S. college students [2].However, our study was the first to evaluate the psychometric properties of the PHQ-4 scale among the general U.S. adult population and the comparison of the PHQ-4 scale across USborn and foreign-born Americans.Our findings provided evidence for the PHQ-4 scale, especially the two-factor structure of the PHQ-4, as a reliable and valid self-administered measure of anxiety and depression symptoms in the general U.S. adult population.The findings also demonstrated high internal consistency of the PHQ-4 scale (α = 0.92).The CFA results showed that the model fit the data well for both US-born and foreign-born groups, but the model fits the data better in the foreign-born group.Further examination of the factor invariance or consistency of the PHQ-4 scale for varying groups (US-born vs. foreign-born) using MCFA revealed that PHQ-4 could be used to assess anxiety and depression symptoms equally across US-born and foreign-born Americans, indicating that the scores on the PHQ-4 scale can be compared across these two native groups.Consistent with previous findings [6,13], our results showed evidence of the two-dimensional structure (i.e., GAD-2 vs. PHQ-2) compared to the one-dimensional structure (i.e., the PHQ-4 total score) of the PHQ-4 scale.These findings affirm the use of the PHQ-4 scale as a two-dimensional instrument to measure anxiety and depression symptoms in the general U.S. adult population.The reliabilities or the internal consistencies of the PHQ-4 scale (α = 0.92), PHQ-2 scale (α = 0.86), and GAD-2 scale (α = 0.90) in our study are higher than those reported in the general populations of Germany (PHQ-4: α = 0.82; PHQ-2: α = 0.78; and GAD-2: α = 0.75) and Colombia (PHQ-4: α = 0.86; PHQ-2: α = 0.83; and GAD-2: α = 0.79) by Löwe et al. [13] and Sanabria-Mazo et al. [17], respectively.The findings, therefore, indicate higher reliability of the PHQ-4 scale and its subscales in the general U.S. population compared to the general populations of Germany and Colombia.Existing studies indicated that anxiety and depression are strong co-occurring disorders with increased disability severity due to their comorbidity [1,5,6,56].Similarly, our findings revealed high intercorrelation of the subscales of PHQ-2 and GAD-2 (r = 0.92), which are higher than those reported in the German (r = 0.79) and Colombian (r = 0.68) studies.The two-dimensional factor loadings in our study (rs = 0.82 to 0.92) were also higher than those reported in the German (rs = 0.73 to 0.87) and the Colombian (rs = 0.71 to 0.92) studies.The high correlation between anxiety and depression in our study implies that they have a weak discriminant validity but strong convergent validity [12,13].Although these measures are conceptualized as different, their high correlation operates against their discriminant validity [12,13].Thus, the two measures did not share more variance with their associated indicators than with their different sets of indicators.Consequently, the two measures may not be distinguishable from each other; using only the PHQ-2 or GAD-2 to assess the symptoms of anxiety or depression may not fully assess the symptoms [12,13].Additional studies, especially longitudinal and clinical studies, are needed to further examine the stability of the discriminant and convergent validities, and the effectiveness of the two scales to screen for anxiety and depression symptoms in the general population.
It should, however, be noted that while our study was conducted among adults aged 18 years or more (N = 5,140) between May 2021 and January 2022, Sanabria-Mazo et al. [17] conducted their study among adults aged ≥ 18 years (N = 18,061) in Colombia between May and June 2020.Although these two studies were conducted during the COVID-19 pandemic, the Colombian study was conducted during the initial phase, while our study was conducted during the later phases of the pandemic.The higher reliabilities of the scales in our study could be due to the negative impact of the pandemic on the severity of anxiety and depression symptoms during the later phases of the pandemic, as mental health symptoms coupled with unemployment, social distancing stress, high rental costs, and inflation rates have worsened during the pandemic [1,4,[57][58][59].Thus, mental health symptoms increased during the pandemic and therefore, higher scores on the scale items could result in their higher inter-relatedness leading to higher reliability scores [60].Similarly, Löwe et al. [13] also conducted their study among individuals aged ≥ 14 years (N = 5036) in Germany between May and June 2006, which reported lower reliabilities of the scales 16 years before the pandemic compared to those reported in our study and Colombia during the initial and later phases of the pandemic, respectively.Additionally, the differences in the internal consistencies could be attributed to younger samples in the German study compared to adult samples in our and Colombian studies.
Our results showed that the two-factor structure of the PHQ-4 was invariant across the two native groups (USborn vs. foreign-born individuals) in this study, although the model demonstrated better fit in the foreign-born group based on the model fit indices (RMSEA = 0.128, SRMR = 0.018, TLI = 0.964, and CFI = 0.988) than in the US-born group (RMSEA = 0.226, SRMR = 0.027, TLI = 0.902, and CFI = 0.967).That is, the factor structure of the PHQ-4 is comparable or consistent across native groups, but the PHQ-4 scale might be used to screen anxiety and depression symptoms more accurately in the foreign-born population than in the US-born population.However, the RMSEA values for both nativity groups in the single-group analyses were higher than the 0.1 thresholds to suggest marginal fit.These high RMSEA values could be due to the use of ML estimator or method, which produces higher RMSEA values than using other estimators (e.g., maximum likelihood with robust standard errors [MLR], weighted least squares [WLS], weighted least squares mean adjusted [WLSM], weighted least squares mean-variance adjusted [WLSMV], unweighted least squares [ULS], and diagonally weighted least squares [DWLS]) [61][62][63].Thus, RMSEA values are based on a fit function specific to an estimator [62].RMSEA is a function of χ 2 statistic, and ML estimator significantly influences χ 2 test of model fit such as RMSEA fit index [62].Nonetheless, we assessed other model fit indices that confirmed the better-fitted models; therefore, the high RMSEA values alone do not invalidate the adequacy of the CFA models.
Parallel with earlier studies [6,13,17,22,29], the findings of our study demonstrated the construct validity or convergent validity of the PHQ-4 and its subscales of PHQ-2 and GAD-2 in the general U.S. population.We found significant positive associations of the PHQ-4 scale and its subscales with the measure of loneliness (i.e., UCLA Loneliness scale -short version), denoting good convergent validity of the PHQ-4 scale and its subscales.The findings also imply that individuals who scored highly on the loneliness scale also scored highly on the PHQ-4 and its subscales.Therefore, loneliness might increase the risk of experiencing anxiety or depression symptoms, especially during the pandemic when social distancing rules were enforced, and social gatherings and traveling were restricted [23].The scales were also associated with sociodemographic factors, including nativity, age, race/ethnicity, sexual and gender identity, level of education, marital status, employment, and income, as factors contributing to differences in anxiety and depression [6,13,22,29].Similar to previous studies [64][65][66], we observed that US-born individuals, younger adults aged 18-34 years, gender (i.e., non-binary/transgender/other) and sexual minority (i.e., lesbian or gay, bisexual, and others) individuals, Latinos/Hispanics and other racial groups (Black/African Americans and White Americans had similar scores), individuals with lower educational and income levels, unemployed individuals, and never been married individuals had higher anxiety (i.e., GAD-2), depression (i.e., PHQ-2), and their total scores (i.e., PHQ-4) compared with their counterparts.The evidence further suggests that the effect sizes of the PHQ-4 scores were larger than the scores of the subscales across all the sociodemographic factors.The higher effects sizes support the need to use the PHQ-4 scale other than the subscales to assess anxiety and depression symptoms as comorbid symptoms.Thus, individuals who experience anxiety also experience depression, and therefore, their comorbidity severity or scores can be higher than their individual scores, necessitating the use of the two-factor measure or PHQ-4 scale to assess anxiety and depression symptoms [1,5,6,56].
Despite the strengths of this study in using large samples to evaluate the reliability and validity of the PHQ-4 scale and its subscales in the general U.S. adult population, the following limitations should be considered.First, because this is a cross-sectional study, we could not examine the stability of the measures among the same population over time.Next, the screening instruments (i.e., PHQ-4 and its subscales) were used to assess symptoms and not clinical diagnosis; we thus did not conduct clinical assessments to determine the presence or absence of clinical disorders (e.g., GAD and MDD) for clinical treatments.Moreover, because we performed an unweighted analysis, the generalizability of our findings is limited to the sample of the general population we analyzed.Furthermore, the self-reported responses could have resulted in misclassification or response biases, which often lead to underestimation of health behaviors and mental health, including symptoms of anxiety and depression.Finally, the lack of discriminant validity in our study may suggest limited evidence of the construct validity because only convergent validity was established despite the two measures being conceptualized as distinct [12,13].

Conclusions
To date, few studies have examined the reliability and validity of the PHQ-4 scale and its subscales in general populations.Notably, little to no studies examined these scales psychometric properties in the general U.S. population.Our study added evidence that the PHQ-4 is a reliable and valid instrument for assessing anxiety and depression symptoms in the general adult population.The two-factor structure of the PHQ-4 is a better and more efficient instrument than the onefactor structure, which further supports the evidence of the comorbidity of anxiety and depression symptoms and the need to use the two-factor measure [1,5,6,56].The two-factor structure was also consistent across US-born and foreign-born groups in this study, confirming this instrument as a potentially efficient rapid mass screener for anxiety and depression symptoms in the general U.S. population.Given limited public health resources (e.g., time, personnel, cost) for the assessment of anxiety and depression in clinical settings, widespread implementation of this instrument, especially during pandemics, can provide rapid information on anxiety and depression symptoms in the general population for consideration as these symptoms often affect health behaviors and decisions.Future studies may examine the test-retest reliability of the PHQ-4 scale and its subscales to determine their reliability or stability over time among the same general population to provide further evidence for the consistency of these measures in measuring anxiety and depression symptoms in the general population.Finally, other subgroup differences in the psychometric properties of the PHQ-4 and its subscales should be evaluated to determine their consistencies across major risk groups (e.g., sexual and gender minority, racial/ethnic minority individuals) as ongoing efforts to improve research to address health disparities.

Fig. 1
Fig. 1 Confirmatory factor analyses' estimates for the PHQ-4 one-and two-factor models

Table 5
Multigroup confirmatory factor analysis (MCFA) in two groups of US-born and foreign-born (N = 5,140)